[fix] Using a pure Go parser for Python#2320
[fix] Using a pure Go parser for Python#2320linzhp wants to merge 3 commits intobazel-contrib:mainfrom
Conversation
|
FYI, @dougthor42, you might be using the feature of comments in the source code, would you have time to check this PR please? |
|
hmm, this only support Python 3.4 or below... Marking it as draft for now |
|
@linzhp Any update on this? TBH it kinda looks like github.com/go-python/gpython is defunct. It hasn't had any updates since Feb. I've run into another issue with the
I would be quite sad if gazelle annotations (comments in python files) were removed. That's the only way we've been able to deal with pre-existing circular imports within our code base when migrating to Bazel. |
|
I wonder if it would be easier to just make our own go parser for python? I've already done a bit while working on a hack for #2396. The key thing to note is that Gazelle only needs to know a very small subset of python grammar:
For the "happy path", comments and Things start to get tricky when:
For the record, here's some very-WIP pure go parser for the happy path above. This is what I'm playing with to address #2396 // file_parser.go
// A really really dumb python parser that's "good enough". Hopefully.
func (p *FileParser) naiveFileParser(code []byte) (*ParserOutput, error) {
codeStr := string(code[:])
// Parse imports
modules, err := p.naiveImportParser(codeStr)
if err != nil {
return nil, err
}
p.output.Modules = modules
// Parse comments
// TODO: find comment-in-string or comment-in-comment cases
commentPattern := regexp.MustCompile(`(?m)#.+$`)
commentStrings := commentPattern.FindAllString(codeStr, -1)
var comments []comment
for _, s := range commentStrings {
comments = append(comments, comment(s))
}
p.output.Comments = comments
// Parse 'if __name__' block
// TODO: Support other similar forms like b == a or single quotes
p.output.HasMain = strings.Contains(codeStr, `if __name__ == "__main__":`)
return &p.output, nil
}
func indexOf(element string, data []string) (uint32, error) {
for k, v := range data {
if element == v {
return uint32(k), nil
}
}
err := errors.New(fmt.Sprintf("Did not find %q in data.", element))
return 0, err
}
// naiveImportParser runs a very naive, mostly regex-based, parse of the python source code
// as a string.
// NOTE: This does not support multi-line imports like
// from foo import (
// bar,
// baz
// )
// Nor does this _really_ support comma-separated like
// from collections.abc import Callable, Iterator
func (p *FileParser) naiveImportParser(code string) ([]module, error) {
// Find `import a`, `from a import b`, `import a as b`, `from a import b as c`
// The pattern doesn't need to be bullet-proof; we'll remove things like comments later.
importPattern := regexp.MustCompile(`(?m)^(\w*import .+)|(\w*from .+ import .+)( as .+)?`)
importStatements := importPattern.FindAllString(code, -1)
// Also support Windows files >:-(
lines := strings.Split(strings.ReplaceAll(code, "\r\n", "\n"), "\n")
var modules []module
for _, v := range importStatements {
lineNum, err := indexOf(strings.TrimSpace(v), lines)
if err != nil {
return nil, err
}
// Remove any comments that might be present.
statement, _, _ := strings.Cut(v, "#")
statement = strings.TrimSpace(statement)
// At this point we have something like "import foo", "import foo as bar",
// "from foo import bar", or "from foo import bar as baz"
// So we need to "do the needful" to make the dotted import 'foo.bar'
dottedImport := ""
from := ""
// We don't care about aliases. Drop them.
statement, _, _ = strings.Cut(statement, " as ")
// 'from foo import bar' is the same as 'import foo.bar'. Because we don't need
// to support 'from foo import some_attribute'.
// dougthor42 is lucky here. A "real" implementation might need to worry about this.
if strings.HasPrefix(statement, "from ") {
first, second, _ := strings.Cut(statement[5:], " import ")
// second might be a list of items "from foo import A, B, c, d"
// If that's the case, we're (probably) importing Attributes so we don't
// need any of them.
// TODO: The more correct thing to do here would be to make N module structs.
// But for now, I'm lazy.
if strings.Contains(second, ",") {
second = strings.Split(second, ",")[0]
}
// "from foo import bar" --> "foo.bar"
dottedImport = fmt.Sprintf("%s.%s", first, second)
from = first
} else {
// "import foo.bar" --> "foo.bar"
dottedImport = statement[7:]
}
mod := module{
Name: dottedImport,
LineNumber: lineNum,
Filepath: p.relFilepath,
From: from,
}
modules = append(modules, mod)
}
return modules, nil
} |
|
Thanks for being interested. My next attempt would be using antlr to parse Python, but I haven't got there yet. Maybe you can give it a try?
Can you use |
The readme says that grammar only targets python 3.6 🫤
Sadly no, as that would apply at the Bazel package level and thus impact all targets. We need target-level (which for us are file-level) ignores I see a couple options:
|
Replacing tree-sitting with github.com/go-python/gpython, which is in pure Go. This re-enable the ability to cross-compile Gazelle from other platforms to macOS.
Note that gpython doesn't preserve Python comments, while Python extension read Gazelle directives in Python comments (not just BUILD.bazel file comments), I have to scan the Python file for a second time to find the comments, but this doesn't include inline comments. So this will break existing usage of Gazelle directives inline Python comments, which I think should be very rare. In fact, unless I miss anything, I don't think we should allow Gazelle directives in Python comments in the first place. They are designed to be in BUILD.bazel files.
Fixes #1913
@alexeagle FYI.